Fixed Q-Targets

## Summary

In Q-Learning, we update a guess with a guess, and this can potentially lead to harmful correlations. To avoid this, we can update the parameters w in the network \hat{q} to better approximate the action value corresponding to state S and action A with the following update rule:

\Delta w = \alpha \cdot \overbrace{( \underbrace{R + \gamma \max_a\hat{q}(S', a, w^-)}_{\rm {TD~target}} - \underbrace{\hat{q}(S, A, w)}_{\rm {old~value}})}^{\rm {TD~error}} \nabla_w\hat{q}(S, A, w)

where w^- are the weights of a separate target network that are not changed during the learning step, and (S, A, R, S') is an experience tuple.

Note: Ever wondered how the example in the video would look in real life? See: Carrot Stick Riding.

## Quiz

SOLUTION:

The Deep Q-Learning algorithm uses two separate networks with identical architectures.
The target Q-Network's weights are updated less often (or more slowly) than the primary Q-Network.
Without fixed Q-targets, we would encounter a harmful form of correlation, whereby we shift the parameters of the network based on a constantly moving target.